System and method for sequential probabilistic object classification

ABSTRACT

Methods and systems are provided for classifying an object appearing in multiple sequential images. The process includes determining a neural network classifier having multiple object classes for classifying objects in images; determining a likelihood classifier model comprising a likelihood vector of class probability vectors; for each image z, running the image multiple respective times through the neural network classifier, applying dropout each time, to generate a point cloud of class probability vector values {γt}; calculating a vector of posterior distributions {λt} for each class and for each of the multiple {γt}, where calculating each class element of {λt} includes calculating a product of the respective element of the class probability vectors and an element of the posterior distribution of a prior image; randomly selecting a subset of {λt} to form a new subset of {λt}; and repeating the calculation of the subset {λt} for each of the images, to determine a cloud of posterior probability vectors approximating a distribution over posterior class probabilities, given all the multiple sequential images.

FIELD OF THE INVENTION

The present invention relates to image processing for machine vision.

BACKGROUND

Classification and object recognition is a fundamental problem inrobotics and computer vision, a problem that affects numerous problemdomains and applications, including semantic mapping, object-level SLAM,active perception and autonomous driving. Reliable and robustclassification in uncertain and ambiguous scenarios is challenging, asobject classification is often viewpoint dependent, influenced byenvironmental visibility conditions such as lighting, clutter, imageresolution and occlusions, and limited by a classifier's training set.In these challenging scenarios, classifier output can be sporadic andhighly unreliable. Moreover, approaches that rely on most likely classobservations can easily break, as these observations are treated equallyregardless if the most likely class has high probability or not,potentially giving large significance to ambiguous observations. Indeed,modern (deep learning based) classifiers provide much richer informationthat is being discarded by resorting to only most likely observations.Current convolutional neural network (CNN) classifiers provide not onlyvector of class probabilities (i.e. probability for each class), but,recently, also output an uncertainty measure, quantifying how(un)certain each of these probabilities is. Even though CNN-basedclassification has achieved some good results in the last few years, aswith any data driven method, actual performance heavily depends on thetraining set. In particular, if the classified object is representedpoorly in the training set, the classification result will be unreliableand vary greatly with slightly different NN classifier weights. Thisvariation is referred to as model uncertainty. High model uncertaintytends to arise from input that is far from the NN classifier's trainingset, which could be caused by an object not being in the training set orby occlusions. In addition, classification, where each frame is treatedseparately, is influenced by environmental conditions such as lightingand occlusions. Consequently, it can provide unstable classificationresults.

Various methods have been proposed to compute model uncertainty from asingle image, the disclosures of which are hereby incorporated byreference, such as: Yarin Gal and Zoubin Ghahramani, “Dropout as aBayesian approximation: Representing model uncertainty in deeplearning,” Intl. Conf. on Machine Learning (ICML), 2016 (hereinbelow,“Gal and Ghahramani”); and Pavel Myshkov and Simon Julier, “Posteriordistribution analysis for Bayesian inference in neural networks,”Advances in Neural Information Processing Systems (NIPS), 2016. Toaddress this problem, various Bayesian sequential classificationalgorithms that maintain a posterior class distribution were developed.These include the following, the disclosures of which are herebyincorporated by reference: W T Teacy, et al., “Observation modeling forvision-based target search by unmanned aerial vehicles,” Intl. Conf. onAutonomous Agents and Multiagent Systems (AAMAS), pp. 1607-1614, 2015;Javier Velez, et al., “Modeling observation correlations for activeexploration and robust object detection,” J. of Artificial IntelligenceResearch, 2012; T. Patten, et al., “Viewpoint evaluation for online 3-dactive object classification,” IEEE Robotics and Automation Letters(RA-L), 1(1):73-81, January 2016.

Methods have also been developed for computing model uncertainty fordeep learning applications. A normalized entropy of class probabilitymay be used as a measure of classification uncertainty, as described byGrimmett et al., “Introspective classification for robot perception,”Intl. J. of Robotics Research, 35(7):743-762, 2016, whose disclosuresare incorporated herein by reference. However, none of these approachesaddress model uncertainty. Crucially, while posterior class distributionfuses all classifier outputs thus far, it does not provide anyindication regarding how reliable the posterior classification is. InBayesian inference over continuous random variables (e.g. SLAM problem),this would correspond to getting the maximum a posteriori solutionwithout providing the uncertainty covariances. Clearly, this is highlyundesired, in particular in the context of safe autonomous decisionmaking (e.g. in robotics, or for self-driving cars), where a keyquestion is when should a decision be made given available data thusfar. (See, for example, Indelman, et al., “Incremental distributedinference from arbitrary poses and unknown data association: Usingcollaborating robots to establish a common reference.” IEEE ControlSystems Magazine (CSM), Special Issue on Distributed Control andEstimation for Robotic Vehicle Networks, 36(2):41-74, 2016, thedisclosures of which are hereby incorporated by reference.)

On the other hand, existing approaches that account for modeluncertainty do not consider sequential classification. As a consequence,none of the existing approaches reason about the posterior uncertainty,given images previously acquired. To draw conclusions about uncertaintyin posterior classification, it would be useful to maintain adistribution over posterior class probabilities while accounting formodel uncertainty.

SUMMARY OF THE INVENTION

Embodiments of the present invention provide methods and systems forclassifying an object appearing in multiple sequential images, by aprocess including: determining a neural network (NN) classifier havingmultiple object classes for classifying objects in images; determining alikelihood classifier model comprising a likelihood vector of classprobability vectors; for each image z, running the image multiplerespective times through the NN classifier, applying dropout each time,to generate a point cloud of class probability vector values {γ_(t)};calculating a vector of posterior distributions {λ_(t)} for each classand for each of the multiple {γ_(t)}, where calculating each classelement of {λ_(t)} includes calculating a product of the respectiveelement of the class probability vectors and an element of the posteriordistribution of a prior image; randomly selecting a subset of {λ_(t)} toform a new subset of {λ_(t)}; repeating the calculation of the subset{λ_(t)} for each of the images, to determine a cloud of posteriorprobability vectors approximating a distribution over posterior classprobabilities, given all the multiple sequential images.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of the invention, reference is made tothe following description and accompanying drawings, in which:

FIGS. 1a-g illustrate examples for inference of a posterior classdistribution,

(λ_(k)|z_(1:k)), from

(γ_(k)|z_(k)) and

(λ_(k−1)|z_(1:k)) using a known classifier model, considering threepossible classes, according to embodiments of the present invention;

FIGS. 2a-d illustrate a case where posterior uncertainty grows with eachadditional image viewed, according to embodiments of the presentinvention;

FIGS. 3a-c illustrate probabilities of a classifier likelihood model forthree classes, and FIGS. 3d-f illustrate classification point clouds forthree images, according to embodiments of the present invention;

FIGS. 4a-d present results in terms of expectation

(λ_(k) ^(i)) and √{square root over (Var(λz_(k) ^(i)))} for each ofthree classes, as a function of classifier measurements, according toembodiments of the present invention;

FIGS. 5a-c present the development of {λ_(k)} point clouds showing thespread of points at different time steps, according to embodiments ofthe present invention;

FIGS. 6a-d present four of the dataset images, exhibiting occlusions,blur, and different colored filters in a monotone environment, accordingto embodiments of the present invention;

FIGS. 7a-f present the simplex representations of the classifier modelper class, and a normalized simplex of classifier outputs for three highprobability classes, according to embodiments of the present invention;

FIGS. 8a-d show the classification results for all the methodspresented, according to embodiments of the present invention;

FIGS. 9a and 9b present the computational time comparison betweenmethods of inference with and without sub-sampling, according toembodiments of the present invention; and

FIG. 10 is a listing of pseudo-code of a process for determining a pointcloud {λ_(t)} that approximates a distribution over posterior classprobabilities for time k (i.e.

(λ_(t)|z_(1:t))), according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Embodiments of the present invention provide methods for inferring adistribution over posterior class probabilities with a measure ofuncertainty using a deep learning NN classifier. As opposed to priormethods, the approach disclosed herein facilitates quantification ofuncertainty in posterior classification given all historicalobservations, and as such facilitates robust classification,object-level perception and safe autonomy. In particular, we provide acurrent posterior class probability vector that is a function of aprevious posterior class probability vector, accounting for modeluncertainty. We used a sub-sampling approximation to obtain a pointcloud that approximates the function's distribution. Our approach wasstudied both in simulation and with real images fed into a deep learningclassifier, providing classification posterior along with uncertaintyestimates for each time instant

Problem Formulation

Consider a robot observing a single object from multiple viewpoints,aiming to infer its class while quantifying uncertainty in the latter.Each class probability vector is γ_(k)

[γ_(k) ¹ . . . γ_(k) ^(i) . . . γ_(k) ^(M)], where M is the number ofcandidate classes. Each element γ_(k) ^(i) is the probability of objectclass c being i given image z_(k), i.e. γ_(k) ^(i)≡

(c=i|z_(k)), while γ_(k) resides in the (M−1) simplex such that

γ_(k) ^(i)≥0 ∥γ_(k)∥₁=1.   (1)

Existing Bayesian sequential classification approaches do not considermodel uncertainty, and thus maintain a posterior distribution λ_(k) fortime k over c,

λ_(k)

(c|γ_(1:k)),   (2)

given history γ_(1:k) obtained from images z_(1:k). In other words,λ_(k) is inferred from a single sequence of γ_(1:k), where each γ_(t)for t ∈ [1, k] corresponds to an input image z_(t). However, theposterior class probability λ_(k) by itself does not provide anyinformation regarding how reliable the classification result is due tomodel uncertainty. For example, a classifier output γ_(k) may have ahigh score for a certain class, but if the input is far from theclassifier training set the result is not reliable and may vary greatlywith small changes in the scenario and classifier weights.

Embodiments of the present invention quantify model uncertainty, i.e.quantify how “far” an image input z_(t) is from a training set D bymodeling the distribution

(γ_(t)|z_(t), D). Given a training set D and classifier weights w, theoutput γ_(t) is a deterministic function of input z_(t) for all t ∈ [1,k]:

γ_(t)=ƒ_(w)(z _(t)),   (3)

where the function ƒ_(w) is a classifier with weights w. However, w arestochastic given D, thus inducing a probability

(w|D) and making γ_(t) a random variable. Gal and Ghahramani showed thatan input far from the training set will produce vastly differentclassifier outputs for small changes in weights. Unfortunately,

(w|D) is not given explicitly. To combat this issue, Gal and Ghahramaniproposed to approximate

(w|D) via dropout, i.e. sampling w from another distribution closest to

(w|D) in a sense of KL divergence. Practically, an input image z_(t) isrun through an NN classifier with dropout multiple times to get manydifferent γ_(t)'s for corresponding w realizations, creating a pointcloud of class probability vectors. Note that every distributiondescribed herein is dependent on the training set D. This reference to Dis omitted in the equations below.

Hereinbelow, a class-dependent likelihood

(γ_(k))

(γ_(k)|c=i), referred as a likelihood classifier model, is utilized.This likelihood classifier model is a likelihood vector denoted as

(γ_(k))

[

₁(γ_(k)) . . .

_(M)(γ_(k))]. (An uninformative prior

(c=i)=1/M is assumed.) The likelihood classifier model is based on aDirichlet distributed classifier model with a different hyperparametervector θ_(i) ∈

^(M×1) per class i ∈ [1, M], such that

(γ_(k)|c=i) may be written as:

_(i)(γ_(k))=Dir(γ_(k); θ_(i)).   (4)

The Dirichlet distribution is the conjugate prior of a categoricaldistribution, and therefore supports class probability vectors,particularly γ_(k). Sampling from a Dirichlet distribution necessarilysatisfies conditions (1), unlike other distributions such as Gaussian.The probability density function (PDF) of the above distribution is asfollows:

$\begin{matrix}{{{{\mathbb{L}}_{i}\left( \gamma_{k} \right)} = {{C\left( \theta_{i} \right)}{\prod\limits_{j = 1}^{M}\left( \gamma_{k}^{j} \right)^{\theta_{i}^{j} - 1}}}},} & (5)\end{matrix}$

where C(θ_(i)) is a normalizing constant dependent on θ_(i), and θ_(i)^(j) is the j-th element of vector θ_(i).

(γ_(k) |c=i)

_(i)(γ_(k)),

(·|c=i)

_(i).   (6)

The likelihood classifier model

_(i)(γ_(k)) must be distinguished from the model uncertainty derivedfrom

(γ_(k)|z_(k)) for class i and time step k. The likelihood classifiermodel

_(i)(γ_(k)) is the likelihood of a single γ_(k) given a class hypothesisi. The hyperparameters θ_(i) ^(j) of the model are inferred (i.e.,computed) prior to the scenario for each class from the training set,and these parameters are taken as constant within the scenario. Methodsfor computing the hyperparameters are described in section 3 of J.Huang, “Maximum likelihood estimation of Dirichlet distributionparameters,” CMU Technique Report, 2005. By contrast,

(γ_(k)|z_(k)) is the probability of γ_(k) given an image z_(k), and iscomputed during the scenario. Note that if the true object class is iand it is “close” to the training set, the probabilities

(γ_(k)|z_(k)) and

_(i)(γ_(k)) will be “close” to each other as well.

A key observation is that λ_(k) is a random variable, as it depends onγ_(1:k) (see Eq. (2)) while each γ_(t), with t ∈ [1, k], is a randomvariable distributed according to

(γ_(t)|z_(t), D). Thus, rather than maintaining the posterior Eq. (2),our goal is to maintain a distribution over posterior classprobabilities for time k, i.e.

(λ_(k)|z_(1:k)).   (7)

This distribution permits the calculation of the posterior classdistribution,

(c|z_(1:k)), via expectation

$\begin{matrix}{{{{\mathbb{P}}\left( {c = \left. i \middle| z_{1:k} \right.} \right)} = {{\int_{\lambda_{k}^{i}}{{{\mathbb{P}}\left( {{c = \left. i \middle| \lambda_{k}^{i} \right.},z_{1:k}} \right)}{{\mathbb{P}}\left( \lambda_{k}^{i} \middle| z_{1:k} \right)}d\lambda_{k}^{i}}} = {{\int_{\lambda_{k}^{i}}{{{\mathbb{P}}\left( {c = \left. i \middle| \lambda_{k}^{i} \right.} \right)}{{\mathbb{P}}\left( \lambda_{k}^{i} \middle| z_{1:k} \right)}d\lambda_{k}^{i}}} = {{\mathbb{E}}\left\lbrack \lambda_{k}^{i} \right\rbrack}}}},} & (8)\end{matrix}$

based on the identity

(c=i|λ_(k) ^(i))=λ_(k) ^(i).

Moreover, as will be seen, Eq. (7) allows to quantify the posterioruncertainty, thereby providing a measure of confidence in theclassification result given all data thus far.

Here, it is useful to summarize our assumptions:

-   -   1. A single object is observed multiple times.    -   2.        (γ_(t)|z_(t), D) is approximated by a point cloud {γ_(t)} for        each image z_(t).    -   3. An uninformative prior for        (c=i).    -   4. A Dirichlet distributed classifier model with designated        parameters for each class c ∈ [1, . . . , M]. These parameters        are constant and given (e.g. learned).

Approach

We aim to find a distribution over the posterior class probabilityvector λ_(k) for time k, i.e.

(λ_(k)|z_(1:k)). First, λ_(k) is expressed given some specific sequenceγ_(1:k). Using Bayes' law:

λ_(k) ^(i)=

(c=i|γ _(1:k)) ∝

(c=i|γ _(1:k−1))

(γ_(k) |c=i, γ _(1:k−1)).   (9)

We assume, for simplicity, that NN classifier outputs are statisticallyindependent. (Hereinbelow, viewpoint-dependent classifier models are notapplied and models are assumed to be γ_(1:k) statistically independentfrom each other.) We can re-write Eq. (9) as

λ_(k) ^(i) ∝

(c=i|γ _(1:k−1))

(γ_(k) |c=i).   (10)

Per the definition for λ_(k−1) (Eq. (2)) and

(γ_(k)|c=i) (Eq. (6)), λ_(k) ^(i) assumes the following recursive form:

λ_(k) ^(i) ∝ λ_(k−1) ^(i)

_(i)(γ_(k)).   (11)

Given that γ_(t) (for each time step t ∈ [1, k]) is a random variable,λ_(k−1) ^(i) and λ_(k) ^(i) are also random variables. Thus, our problemis to infer

(λ_(k)|z_(1:k)), where, according to Eq. (11), for each realization ofthe sequence γ_(1:k), λ_(k) is a function of λ_(k−1) and γ_(k).

The approach is shown as Algorithm 1 of FIG. 10. At each time step t, anew image z_(t) is classified using multiple forward passes through aCNN with dropout, yielding a point cloud {γ_(t)}. Each forward passgives a probability vector γ_(t) ∈ {γ_(t)}, which is used to compute aDirichlet distribution of the class likelihood

(γ_(t)). In addition, {λ_(t−1)} is a point cloud (i.e., set of elements)from the previous step. All possible pairs of λ_(t−1) ^(i) and

_(i)(γ_(t)) are multiplied, as in Eq. (11). Finally N_(ss,n) pairs arechosen for the next step, in a sub-sampling algorithm that will bedetailed hereinbelow. This results in a point cloud {λ_(t)} thatapproximates

(λ_(t)|z_(1:t)).

The algorithm must be initialized for the first image. Recalling Eq.(2), λ₁ ^(i) (first image) is defined for class i and time k=1 as:

λ₁ ^(i)

(c=i|γ ₁).   (12)

Using Bayes law:

$\begin{matrix}{{{\mathbb{P}}\left( {c = \left. i \middle| \gamma_{1} \right.} \right)} = \frac{{{\mathbb{P}}\left( {\left. \gamma_{1} \middle| c \right. = i} \right)}{{\mathbb{P}}\left( {c = i} \right)}}{{\mathbb{P}}\left( \gamma_{1} \right)}} & (13)\end{matrix}$

where

(c=i) is a prior probability of class i,

(γ₁) serves as a normalizing term, and

(γ₁|c=i) is the classifier model for class i. Per definition Eq. (6),Eq. (13) can be written as:

λ₁ ^(i) ∝

(c=i)

_(i)(γ₁),   (14)

thus λ₁ ^(i) is a function of prior

(c=i) and γ₁, and in the subsequent steps the update rule of Eq. (11)can be used to infer

(λ_(k)|z_(1:k)).

It should be noted that there is a numerical issue where λ_(k) ^(i) forsufficiently large k can practically become 0 or 1, preventing anypossible change for future time steps. In embodiments of the presentinvention, this is overcome this by calculating log λ_(k) ^(i) insteadof λ_(k) ^(i).

In the next section the properties of

(λ_(k)|z_(1:k))) are reviewed, as well as the corresponding posterioruncertainty versus time. Two inference approaches that approximate thisPDF are presented.

Inference Over the Posterior

(λ_(k)|z_(1:k))

In this section the distribution

(λ_(k)|z_(1:k)) is analyzed to provide an inference method to track thisdistribution over time. As discussed above, all γ_(t) are randomvariables; hence, according to Eq. (11),

(λ_(k)|z_(1:k)) accumulates all model uncertainty data from all

(γ_(t)|z_(t)) up until time step k, with t ∈ [1, k].

FIGS. 1a-g illustrate examples for inference of

(λ_(k)|z_(1:k)) from

(γ_(k)|z_(k)) and

(λ_(k−1)|z_(1:k)) using a known classifier model, considering threepossible classes. FIGS. 1a-c present example distributions for theclassifier model. FIG. 1d presents a point cloud that describes thedistribution of λ_(k−1). FIG. 1e presents

(γ_(k)|z_(k)) represented by a point cloud of γ_(k) instances. Eachγ_(k) is projected via

(γ_(k)) to a different cloud in the simplex, as presented in FIG. 1 f.Finally, based on Eq. (11), the multiplication of points from FIGS. 1dand 1f creates a {λ_(k)} point cloud, shown in FIG. 1 g. In thepresented scenario, the spread of the {λ_(k)} point cloud (FIG. 1g ) wassmaller than the spread of {λ_(k−1)} (FIG. 1d ), because both pointclouds {λ_(k−1)} and {

(γ_(k))} are near the same simplex edge. In general, classifier modelswith large parameters (see Eq. 5) create {

(γ_(t))} point clouds that are closer to the simplex edge. In turn, the{λ_(k)} point cloud (updated via Eq. (11)) will converge faster to asingle simplex edge.

The graphs of FIG. 1 thus illustrate the inference process of

(λ_(k)|z_(1:k)). FIGS. 1a-c show the

_(i) classifier model for classes 1,2 and 3, respectively, with higherprobability zones presented in yellow. FIG. 1d shows the distribution ofλ_(k−1) from the previous step. Note that for k=1, λ₀ is given by theprior

(c). FIG. 1e shows a point cloud {γ_(k)} approximating

(γ_(k)|z_(k)) via multiple forward passes of the (CNN) classifier withdropout, given a new measurement z_(k) (an image) at current time stepk. FIG. 1f shows the corresponding likelihood

(γ_(k)) for each γ_(k) ∈ {γ_(k)} from FIG. 1 e. Finally, multiplyingλ_(k−1) and

(γ_(k)) (Eq. (11)) results in the point cloud shown in FIG. 1frepresenting a distribution over λ_(k). λ_(k)'s spread is smaller inthis case than λ_(k−1)'s, as both

(γ_(k)) and

(λ_(k−1)|z_(k−1)) are close to the same simplex corner.

As shown in the graphs, the spread of {λ_(k)} is indicative ofaccumulated model uncertainty, and is dependent on the expectation andspread of both {λ_(k−1)} and {γ_(k)}. For specific realizations ofλ_(k−1) and γ_(k), as seen in Eq. (11), λ_(k) ^(i) is a multiplicationof λ_(k−1) ^(i) and

_(i)(γ_(k)). Therefore, when

(γ_(k)) is within the simplex center, i.e.

_(i)(γ_(k))=

_(j)(γ_(k)) for all i, j=1, . . . , M, the resulting λ_(k) will be equalto λ_(k−1). On the other hand, when

(γ_(k)) is at one of the simplex' edges, its effect on λ_(k) will be thegreatest. Expanding to the probability

(λ_(k)|z_(1:k)), there are several cases to consider. If

(λ_(k−1)|z_(1:k−1)) and {

(γ_(k))} “agree” with each other, i.e. the highest probability class isthe same, and both are far enough from the simplex center, the resulting

(λ_(k)|z_(1:k)) will have a smaller spread compared to

(λ_(k−1)|z_(1:k−1)) and its expectation will have the dominant classwith a high probability. On the other hand, if

(λ_(k−1)|z_(1:k−1)) and {

(γ_(k))} “disagree” with each other, i.e. they are close to the samesimplex corner, the spread of

(λ_(k)|z_(1:k)) will become larger; an example for this case isillustrated in FIG. 2. In practice such a scenario can occur when anobject of a certain class is observed from a viewpoint where it appearslike a different class. If both

(λ_(k−1)|z_(1:k−1)) and {

(γ_(k))} are near the simplex center, the spread of

(λ_(k)|z_(1:k)) will increase as well. Finally, if only one of

(λ_(k−1)|z_(1:k−1)) and {

(γ_(k))} is near the simplex center,

(λ_(k)|z_(1:k)) will be similar to the one that is farther from thesimplex center.

As described above, the graphs of FIGS. 2a-d illustrate a case where theposterior uncertainty grows with an additional image. The classifiermodel is the same as in FIG. 1, as well as the inference steps. FIG. 2arepresents

(λ_(k−1)|z_(k−1)). In FIG. 2b the point cloud {γ_(k)} is closer to class3, compared to {λ_(k−1)} cloud from FIG. 2a , which is closer toclass 1. The classifier model translates γ_(k) into

(γ_(k)) in FIG. 2c , projecting the point cloud around class 3, and thusafter the multiplication shown in FIG. 2d , the distribution is morespread out compared to FIG. 2 a.

From

(λ_(k)|z_(1:k)) the expectation

(λ_(k)) (computed as in Eq. (8)) and covariance matrix Cov(λ_(k)) ofλ_(k) may be calculated.

(λ_(k)) takes into account model uncertainty from each image, unlikeexisting approaches (e.g. Omidshafiei, et al., “Hierarchical Bayesiannoise inference for robust real-time probabilistic objectclassification,” preprint arXiv:1605.01042, 2016). Consequently, weachieve a posterior classification that is more resistant to possiblealiasing. The covariance matrix Cov(λ_(k)) represents the spread ofλ_(k), and in turn accumulates the model uncertainty from all imagesz_(1:k). In general, lower Cov(λ_(k)) values represent smaller λ_(k)spread, and thus higher confidence with the classification results.Practically, this can be used in a decision making context, where higherconfidence answers are preferred. For example, values of Var(λ_(k) ^(i))for all classes i=1, . . . , M may be compared, as a means of describingthe uncertainty per class.

Furthermore, there is a correlation between the expectation

(λ_(k)) and Cov(λ_(k)). The largest covariance values will occur when

(λ_(k)) is at the simplex' center. In particular, it is not difficult toshow that the highest possible value for Var(λ_(k) ^(i)) for any i is0.25; it can occur when λ_(k) ^(i)=0.5. In general, if

(λ_(k)) is close to the simplex' boundaries, the uncertainty is lower.Therefore, to reduce uncertainty,

(λ_(k)) should be concentrated in a single high probability class.

The expression

(λ_(k)|z_(1:k)), where the expression for λ_(k) is described in Eq.(11), has no known analytical solution. The next most accurate methodavailable is multiplying all possible permutations of point clouds{γ_(t)}, for all images at times t ∈ [1, k]. This method iscomputationally intractable as the number of λ_(k) points growsexponentially. The next section provides a simple sub-sampling method toapproximate this distribution and keep computational tractability.

Sub-Sampling Inference

As mentioned above, for each measurement, a “cloud” (i.e., a set) ofN_(k) probability vectors {(γ_(k))^(n)}_(n=1) ^(N) ^(k) is generated.Each probability vector is projected via the classifier model to adifferent point with the simplex, which provides a new point cloud {

(γ_(k))^(n)}_(n=1) ^(N) ^(k) . We assume that

(λ_(k−1)|z_(1:k−1)) is described by a cloud of N_(k−1) points. Given thedata for γ_(k) and λ_(k−1), the most accurate approximation to

(λ_(k)|z_(1:k)) is given by multiplying all possible pairs of λ_(k−1)and

(γ_(k)). Thus,

(λ_(k)|z_(1:k)) is described with a cloud of N_(k−1)×N_(k) points. Forsubsequent steps the cloud size grows exponentially, making itcomputationally intractable. We address this problem by randomlysampling from the point cloud for λ_(k) a subset of N_(ss,n) points anduse them for the next time step. In practice, N_(ss,n) may be keptconstant across all time steps, as indicated in line 16 in Algorithm 1.

Experiments

In this section we present results of our method using real images fedinto an AlexNet CNN classifier (as described by Krizhevsky, et al.,“Imagenet classification with deep convolutional neural networks,”Advances in neural information processing systems, pages 1097-1105,2012). We used a PyTorch implementation of AlexNet for classification,and Matlab for sequential data fusion. The system ran on an Inteli7-7700HQ CPU running at 2.8 GHz, and 16 GB of RAM. We compare fourdifferent approaches:

-   -   1. Method-        (c|z_(1:k))-w/o-model: Naive Bayes that infers the posterior of        (c|z_(1:k)) where the classifier model is not taken into account        (SSBF, as described in Omid-shafiei, cited above).    -   2. Method-        (c|z_(1:k))-w-mode 1: A Bayesian approach that infers the        posterior of        (c|z_(1:k)) and uses a classifier model; essentially using        Eq. (11) with a known classifier model.    -   3. Method-        (λ_(k)|z_(1:k))-AP: Inference of        (λ_(k)|z_(1:k)) multiplying all possible combinations of λ_(k−1)        and        (γ_(k)). Note that the number of combinations grows        exponentially with k, thus the results are presented up until        k=5.    -   4. Method-        (λ_(k)|z_(1:k))-SS: Inference of        (λ_(k)|z_(1:k)) using the sub-sampling method.        Embodiments of the present invention are represented by        approaches 3 and 4.

Simulated Experiment

A simulated experiment was conducted to demonstrate the performance ofembodiments of the present invention. The simulation emulated a scenarioof a robot traveling in a predetermined trajectory and observing anobject from multiple viewpoints. This object's class was one of threepossible candidates. We infer the posterior over λ and display theresults as expectation

(λ_(k) ^(i)) and standard deviation per class i:

σ_(i)

√{square root over (Var(λ_(k) ^(i)))}.   (15)

The simulation demonstrated the effect of using a classifier model inthe inference for highly ambiguous measurements. In addition, theuncertainty behavior for the scenario is indicated. A categoricaluninformative prior of

(c=i)=1/M was used for all i=1, . . . , M.

Each of the three classes has its own (known) classifier model Eq. (16),as shown in FIGS. 3a -c. The classifier model is assumed to be Dirichletdistributed with the following hyperparameters θ_(i) for all i ∈ [1, 3]:

θ₁=[6 1 1]

θ₂=[2 7 2]

θ₃=[1 1.5 2].   (16)

In this experiment the true class was 3. The hyperparameters wereselected to simulate a case where the γ measurements were spread out(corresponding to ambiguous appearance of the class), thus leading toincorrect classification without a classifier model. The classifiermodel for this class

₃ predicts highly variable γ's using the training data (FIG. 3c ). The{γ_(t)} point clouds for each t ∈ [1, k] are different from each other(FIG. 3e ), representing an object photographed by a robot from multipleviewpoints.

We simulated a series of 5 images. Each image at time step t has its owndifferent

(γ_(t)|z_(t)). For the approaches that infer

(c|z_(1:k)), we sampled a single γ_(t) per image z_(t) for all t ∈ [1,k] (FIG. 3f , also presents the γ_(t) order). This sample simulated theusual single classifier forward pass that was used. Ten γ_(t)'s fromeach

(γ_(t)|z_(t)) were sampled, except for the first step t=1 where 100 γ₁'swere sampled. For Method-

(λ_(k)|z_(1:k))-SS each {λ_(t)} point cloud was capped at 100 points.The expectation of these generated measurements are presented in FIG. 3d, along with the cloud order. In FIG. 3 e {γ _(t)} point clouds forthree different t's are presented in distinct colors. The input formethods 1 and 2 is shown in FIG. 3f , and some of the input for methods3 and 4 is shown in FIG. 3 e.

FIGS. 4a-d present results obtained with our methods, in terms ofexpectation

(λ_(k) ^(i)) and √{square root over (Var(λ_(k) ^(i)))} for each class i,as a function of classifier measurements. FIGS. 4a-c show posteriorclass probabilities: FIG. 4a shows Method-

(c|z_(1:k))-w/o-model; FIG. 4b shows Method-

(c|z_(1:k))-w-model; FIG. 4c shows

(c|z_(1:k)) calculated via expectation (8) for Method-

(λ_(k)|z_(1:k))-SS and Method-

(λ_(k)|z_(1:k))-AP; FIG. 4d shows the posterior standard deviation Eq.(15) for both of our methods.

In FIGS. 4a and 4b we used a single sampled γ_(t) for z_(t) (see FIG. 3f), while in FIGS. 4c and 4d we create a {γ_(t)} point cloud for z_(t)(see FIG. 3e ). In FIGS. 4a and 4b results are shown for Method-

(c|z_(1:k))-w/o-model and Method-

(c|z_(1:k))-w-model respectively. Without classifier model the resultsgenerally favor class 2 incorrectly, as the measurements tend to givethat class the higher chances. With classifier models the results favorclass 3, the correct class. Because the classifier model for class 3 ismore spread out than for the other classes, γ's in the simplex middle(as in FIG. 3e ) have higher

₃(γ) values than

₁(γ) and

₂(γ). While method Method-

(c|z_(1:k))-w-model gives eventually correct classification results, itdoes not account for model uncertainty, i.e. uses a single classifieroutput γ obtained with a forward run through the classifier withoutdropout. In this simulation we sample a single γ from each point cloudto simulate this forward run.

FIGS. 4c and 4d present the results for the two methods Method-

(λ_(k)|z_(1:k))-SS and Method-

(λ_(k)|z_(1:k))-AP, expectation and standard deviation respectively.Throughout the scenario class 3 has the highest probability correctly,and the deviation drops as more measurements are introduced. Compared toFIG. 4b where class 3 has high probability only at time step t=3, inFIG. 4c class 3 is the most probable from time step t=1. Both Method-

(λ_(k)|z_(1:k))-SS and Method-

(λ_(k)|z_(1:k))-AP behave similarly. Note that class 1 has much smallerdeviation than the other two because its probability is close to 0through the entire scenario.

FIGS. 5a-c present the development of {λ_(k)} point clouds for Method-

(λ_(k)|z_(1:k))-SS at different time steps. These figures show thegradual decrease in {λ_(k)}'s spread, coinciding with the correspondingstandard deviation in FIG. 4 d.

Experiment with Real Images

Our method was tested using a series of images of an object (spaceheater) with conflicting classifier outputs when observed from differentviewpoints. This corresponds to a scenario where a robot in apredetermined path observes an object that is obscured by occlusions anddifferent lighting conditions. The experiment presents our method'srobustness to these difficulties in classification, and addressing themis important for real-life robotic applications.

The database photographed was a series of 10 images of a space heaterwith artificially induced blur and occlusions. Each of the images wasrun through an AlexNet convolutional neural network (NN classifier) with1000 possible classes. As with the simulation described above, we usedan uninformative classifier prior on

(c) with

(c=i)=1/M for all i=1, . . . , M classes. Our method was used to fusethe classification data into a posterior distribution of the classprobability and infer deviation for each class. As with the simulation,we generated results with and without a classifier model. FIGS. 6a-dpresent four of the dataset images, exhibiting occlusions, blur anddifferent colored filters in a monotone environment.

The methods described in the previous sub-sections were implemented asfollows. For Method-

(c|z_(1:k))-w/o-model and Method-

(c|z_(1:k))-w-model, images were run through a neural network (NN)classifier without dropout and used a single output γ for each image.For Method-

(λ_(k)|z_(1:k))-SS, each image was run 10 times through the NNclassifier with dropout, producing a point cloud {γ} per image. The capfor the number of λ_(k) points with the method Method-

(λ_(k)|z_(1:k))-SS was 100. For Method-

(λ_(k)|z_(1:k))-AP, results are presented only for the first five imagesas the calculations became infeasible due to the exponential complexity.

As the AlexNet NN classifier has 1000 possible classes (one of them is“Space Heater”), it is difficult to clearly present results for all ofthem. Because the goal was to compare the most likely classes, weselected 3 likely classes by averaging all γ outputs of the NNclassifier and selecting the three with highest probability. Theprobabilities for those classes were then normalized, and utilized inthe scenario. All other classes outside those three were ignored. Foreach class, we applied a likelihood classifier model; assuming thelikelihood classifier model is Dirichlet distributed, we classifiedmultiple images unrelated to the scenario for each class with the sameAlexNet NN classifier but without dropout. The classifier producedmultiple γ's, one per image, and via a Maximum Likelihood Estimator weinferred the Dirichlet hyperparameters for each class i ∈ [1, 3]. Theclassifier model

(λ_(k)|c=i)=Dir(γ_(k); θ_(i)) was used with the followinghyperparameters θ_(i):

θ₁=[5.103 1.699 1.239]

θ₂=[0.143 208.7 5.31]

θ₃=[0.993 14.31 25.21]  (17)

In this experiment, class 1 is the correct class (i.e. “Space Heater”).FIGS. 7a-f present the simplex representations of the classifier modelper class, and a normalized simplex of classifier outputs for three highprobability classes, similarly to the graphs in FIG. 3. The classifiermodel for class 1 is much more spread than the other two (FIG. 7a ),therefore the likelihood of measurements within a larger area will behigher for this class. Interestingly, the classifier model for class 3predicts

(γ_(k)|c=3) will be between classes 2 and 3 (FIG. 7c ). FIG. 7e presents4 of the 10 {γ_(t)} point clouds used in the scenario. FIG. 7d presentsthe expectation of each {γ_(t)} point cloud for t ∈ [1, 10]. FIG. 7fpresents classifier outputs without dropout, i.e. a single γ_(t) perimage. Both FIGS. 7d and 7f have indices that represent the imagesorder.

FIGS. 8a-d show the classification results for all the methodspresented. FIGS. 8a and 8b show results for Method-

(c|z_(1:k))-w/o-model and Method-

(c|z_(1:k))-w-model respectively. The former methods that do not apply aclassifier model incorrectly indicate class 2 as the most likely,because the classifier outputs often show class 2 as the most likely(see FIG. 7f ). With a classifier model, the results show either class 1or 3 as being most probable. This can be explained by the likelihoodvector

from Eq. (17) that projects the γ's from different images approximatelyto different simplex edges (e.g. γ₂ and γ₄ for class 1, and γ₃ and γ₅for class 3).

FIGS. 8c and 8d present results (i.e., the posterior classprobabilities) for the two methods Method-

(λ_(k)|z_(1:k))-SS and Method-

(λ_(k)|z_(1:k))-AP, expectation and standard deviation respectively.FIG. 8c presents class 1 as most likely correctly in both methods fromk=2 onwards, and the results are smoother than in FIG. 8b because ourmethod takes into account multiple realizations of γ₁ to γ₁₀. This isdue to using a point cloud of γ's for each image. In addition, thestandard deviation of λ_(k), representing the posterior uncertainty, canbe analyzed as in FIG. 8d . Note that starting from the 4th image, theuncertainty increases, as later measurement likelihoods do not agreewith λ_(k−1) about the most likely class at those time steps, similar tothe example presented in FIG. 2. Importantly, the results for method-

(λ_(k)|z_(1:k))-SS are similar to those for Method-

(λ_(k)|z_(1:k))-AP, while offering significantly shorter computationaltimes.

FIGS. 9a and 9b present the computational time comparison between thetwo methods for the scenario presented in this section, includingdifferent numbers of samples N_(ss,n) per time step. FIG. 9a shows acomputational time comparison between Method-

(λ_(k)|z_(1:k))-AP and Method-

(λ_(k)|z_(1:k))-SS per time step. The figure presents computationaltimes for N_(ss,n) ∈ {50, 100, 200, 400} points per time step forMethod-

(λ_(k)|z_(1:k))-SS. FIG. 9b shows the statistical mean square error ofMethod-

(λ_(k)|z_(1:k))-SS as a function of N_(ss,n) ∈ [50, 500] relative toMethod-

(λ_(k)|z_(1:k))-AP. Importantly, the results for Method-

(λ_(k)|z_(1:k))-SS are similar to Method-

(λ_(k)|z_(1:k))-AP while offering significantly shorter computationaltimes. Note that the computational time per step is constant as well forMethod-

(λ_(k)|z_(1:k))-SS. FIG. 9b presents mean square error (MSE) of Method-

(λ_(k)|z_(1:k))-SS compared to the method Method-

(λ_(k)|z_(1:k))-AP, as a function of N_(ss,n). As expected, largerN_(ss,n) values produce lower MSE.

Processing elements of the system described herein may be implemented indigital electronic circuitry, or in computer hardware, firmware,software, or in combinations thereof. Such elements can be implementedas a computer program product, tangibly embodied in an informationcarrier, such as a non-transient, machine-readable storage device, forexecution by, or to control the operation of, data processing apparatus,such as a programmable processor, computer, or deployed to be executedon multiple computers at one site or one or more across multiple sites.Memory storage for software and data may include multiple one or morememory units, including one or more types of storage media. Examples ofstorage media include, but are not limited to, magnetic media, opticalmedia, and integrated circuits such as read-only memory devices (ROM)and random access memory (RAM). Network interface modules may controlthe sending and receiving of data packets over networks. Method stepsassociated with the system and process can be rearranged and/or one ormore such steps can be omitted to achieve the same, or similar, resultsto those described herein. It is to be understood that the embodimentsdescribed hereinabove are cited by way of example, and that the presentinvention is not limited to what has been particularly shown anddescribed hereinabove.

1. A method of classifying an object appearing in k multiple sequentialimages z_(1:k) of a scene, comprising: A) determining, from a trainingset of training images of objects, a neural network (NN) classifierhaving M object classes for classifying objects in images; B)determining a likelihood classifier model

_(i)(γ_(k)) for each of the M object classes, and a likelihood vector

(γ_(k))

[

₁(γ_(k)) . . .

_(M)(γ_(k))], wherein each

_(i)(γ_(k)) is a probability density function (PDF) of a classprobability vector γ_(t) defined as γ_(t)

[γ_(t) ¹ . . . γ_(t) ^(i) . . . γ_(t) ^(M)], wherein each element γ_(t)^(i) is the probability of a class of an object being i, given an imagez_(t); C) for each image z_(t) of the k images, running the imagemultiple respective times through the NN classifier, applying dropouteach time to modify weights of the NN classifier, to generate a pointcloud {γ_(t)} of multiple γ_(t) values, and for each of the multipleγ_(t) values, calculating a vector λ_(t) of posterior distributionsλ_(t) ^(i) for each class, i=1:M, where λ_(t)

[λ_(t) ¹ . . . λ_(t) ^(i) . . . λ_(t) ^(M)], wherein each λ_(t) ^(i) isthe probability of an object being of class i, given the history ofimages z_(i:t), wherein calculating each element λ_(t) ^(i) of thevector λ_(t) comprises multiplying the values of all

_(i)(γ_(t)), for all i=1:M, by each element of a posterior distributionof a prior image λ_(t−1) ^(i), such that λ_(t) ^(i) is proportional to

_(i)(γ_(t))λ_(t−1) ^(i), wherein the posterior distribution of λ_(t−1)^(i) has N_(t−1) points and the distribution of

_(i)(γ_(t)) has N_(t) points, such that the distribution of {λ_(t)} hasN_(k−1)×N_(k) points; D) randomly selecting a subset of N_(ss,n) pointsof {k_(t)} to form a new subset {λ_(t)}, wherein N_(ss,n) is a presetmaximum number of elements of {λ_(t)} for each image; and E) repeatingsteps C and D with the new subset {λ_(t)}, for each of the t=1:k images,to determine a cloud of posterior probability vectors {λ_(k)}.
 2. Themethod of claim 1, further comprising calculating an expectationE(λ_(t−1) ^(i)) for each of the distributions of λ_(t) ^(i) of the cloudof posterior probability vectors {λ_(k)}.
 3. The method of claim 2,further comprising calculating a variance √{square root over (Var(λ_(k)^(i)))}, corresponding to a classifier model uncertainty, for each ofthe distributions of λ_(k) ^(i) of the cloud of posterior probabilityvectors {λ_(k)}.
 4. The method of claim 1, wherein each

_(i)(γ_(t)) is a Dirichlet distributed classifier model.
 5. The methodof claim 1, wherein the cloud of posterior probability vectors {λ_(k)}is an approximation of a distribution over posterior class probabilitiesgiven all the multiple sequential images,

(λ_(k)|z_(1:k)).
 6. The method of claim 5, wherein the distribution overposterior class probabilities given all the k multiple sequentialimages,

(λ_(k)|z_(1:k)) accumulates model uncertainty data from all

(γ_(t)|z_(t)) for all respective time steps t corresponding to a firstthrough a last of the k images.
 7. The method of claim 5, wherein ahighest probability class being the same for both

(λ_(k−1)|z_(1:k−1)) and {

_(i)(γ_(k))} determines that

(λ_(k)|z_(1:k)) has a smaller spread compared to

(λ_(k−1)|z_(1:k−1)).
 8. The method of claim 5, wherein a highestprobability class being the same for both

(λ_(k−1)|z_(1:k−1)) and {

_(i)(γ_(k))} determines a high probability of an expectation of

(λ_(k)|z_(1:k)) being the highest probability class.
 9. The method ofclaim 5, wherein if only one of

(λ_(k−1)|z_(1:k−1)) and {

_(i)(γ_(k))} are near a simplex center,

(λ_(k)|z_(1:k)) will be similar to the one farther from the simplexcenter.
 10. The method of claim 1, wherein each

_(i)(γ_(k)) is trained using images of instances of object of class c=iand a corresponding classifier output γ_(t) ^(i).